34 research outputs found

    Content-Based Weak Supervision for Ad-Hoc Re-Ranking

    Full text link
    One challenge with neural ranking is the need for a large amount of manually-labeled relevance judgments for training. In contrast with prior work, we examine the use of weak supervision sources for training that yield pseudo query-document pairs that already exhibit relevance (e.g., newswire headline-content pairs and encyclopedic heading-paragraph pairs). We also propose filtering techniques to eliminate training samples that are too far out of domain using two techniques: a heuristic-based approach and novel supervised filter that re-purposes a neural ranker. Using several leading neural ranking architectures and multiple weak supervision datasets, we show that these sources of training pairs are effective on their own (outperforming prior weak supervision techniques), and that filtering can further improve performance.Comment: SIGIR 2019 (short paper

    Pseudo Label Is Better Than Human Label

    Full text link
    State-of-the-art automatic speech recognition (ASR) systems are trained with tens of thousands of hours of labeled speech data. Human transcription is expensive and time consuming. Factors such as the quality and consistency of the transcription can greatly affect the performance of the ASR models trained with these data. In this paper, we show that we can train a strong teacher model to produce high quality pseudo labels by utilizing recent self-supervised and semi-supervised learning techniques. Specifically, we use JUST (Joint Unsupervised/Supervised Training) and iterative noisy student teacher training to train a 600 million parameter bi-directional teacher model. This model achieved 4.0% word error rate (WER) on a voice search task, 11.1% relatively better than a baseline. We further show that by using this strong teacher model to generate high-quality pseudo labels for training, we can achieve 13.6% relative WER reduction (5.9% to 5.1%) for a streaming model compared to using human labels.Comment: 6 pages, 2 figures, 9 tables, submitted to INTERSPEEC

    Fix it where it fails: Pronunciation learning by mining error corrections from speech logs

    Full text link
    The pronunciation dictionary, or lexicon, is an essential component in an automatic speech recognition (ASR) system in that incorrect pronunciations cause systematic misrecognitions. It typically con-sists of a list of word-pronunciation pairs written by linguists, and a grapheme-to-phoneme (G2P) engine to generate pronunciations for words not in the list. The hand-generated list can never keep pace with the growing vocabulary of a live speech recognition sys-tem, and the G2P is usually of limited accuracy. This is especially true for proper names whose pronunciations may be influenced by various historical or foreign-origin factors. In this paper, we pro-pose a language-independent approach to detect misrecognitions and their corrections from voice search logs. We learn previously un-known pronunciations from this data, and demonstrate that they sig-nificantly improve the quality of a production-quality speech recog-nition system. Index Terms — speech recognition, pronunciation learning, data extraction, logistic regression 1

    Resource-Efficient Transfer Learning From Speech Foundation Model Using Hierarchical Feature Fusion

    Full text link
    Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream tasks are expensive since the foundation model is usually very big. Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods) offer an alternative paradigm where a small set of parameters are updated to adapt the foundation model to new tasks. However, these methods still suffer from a high computational memory cost and slow training speed because they require backpropagation through the entire neural network at each step. In the paper, we analyze the performance of features at different layers of a foundation model on the speech recognition task and propose a novel hierarchical feature fusion method for resource-efficient transfer learning from speech foundation models. Experimental results show that the proposed method can achieve better performance on speech recognition task than existing algorithms with fewer number of trainable parameters, less computational memory cost and faster training speed. After combining with Adapters at all layers, the proposed method can achieve the same performance as fine-tuning the whole model with 97%97\% fewer trainable encoder parameters and 53%53\% faster training speed
    corecore